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ABSTRACT 

Potential-based reward shaping (PBRS) is an effective and 
popular technique to speed up reinforcement learning by 
leveraging domain knowledge. While PBRS is proven to 
always preserve optimal policies, its effect on learning speed 
is determined by the quality of its potential function, which, 
in turn, depends on both the underlying heuristic and the 
scale. Knowing which heuristic will prove effective requires 
testing the options beforehand, and determining the appro¬ 
priate scale requires tuning, both of which introduce addi¬ 
tional sample complexity. 

We formulate a PBRS framework that improves learning 
speed, but does not incur extra sample complexity. For this, 
we propose to simultaneously learn an ensemble of policies, 
shaped w.r.t. many heuristics and on a range of scales. The 
target policy is then obtained by voting. The ensemble needs 
to be able to efficiently and reliably learn off-policy: re¬ 
quirements fulfilled by the recent Horde architecture, which 
we take as our basis. We demonstrate empirically that (1) 
our ensemble policy outperforms both the base policy, and 
its single-heuristic components, and (2) an ensemble over a 
general range of scales performs at least as well as one with 
optimally tuned components. 

1. INTRODUCTION 

The powerful ability of reinforcement learning (RL) [25] 
to find optimal policies tabula rasa , is also the source of 
its main weakness: infeasibly long running times. As the 
problems RL tackles get larger, it becomes increasingly im¬ 
portant to leverage all possible knowledge about the domain 
at hand. One paradigm to inject such knowledge into the re¬ 
inforcement learning problem is potential-based reward shap¬ 
ing (PBRS) [20 . Aside from repeatedly demonstrated effi¬ 
cacy in increasing learning speed Bulan. the principal 
strength of PBRS lies in its ability to preserve optimal poli¬ 
cies. Moreover, it is the onljQ reward shaping scheme that 
is guaranteed to do so go]. At the heart of PBRS meth¬ 
ods lies the potential function. Intuitively, it expresses the 
“desirability” of a state, defining the shaping reward on a 
transition to be the difference in potentials of the transition¬ 
ing states. States may be desirable by many criteria. The 
pursuit of designing a potential function that accurately en¬ 
capsulates the “true” desirability is meaningless, as it would 
solve the task at hand go] , and remove the need for learning 
altogether. However, one can usually suggest many simple 
heuristic criteria that improve performance in different situ- 

1 Given no knowledge of the environment dynamics. 


at ions. Choosing the most effective heuristic amongst them 
without a test comparison, is typically infeasible, and carry¬ 
ing out such a comparison implies added sample complexity, 
that may be unaffordable. Moreover, heuristics may con¬ 
tribute complementary knowledge that cannot be leveraged 
in isolation g]. 

The choice of a heuristic is merely one of the two decid¬ 
ing factors for the performance of a potential function. The 
other (and one that is even less intuitive) is scaling. An ef¬ 
fective heuristic with a sub-optimal scaling factor may make 
no difference at all, if the factor is too small, or dominate 
the base reward and distract the learner g] if the factor is too 
large. Typically, one is required to tune the scaling factor 
beforehand, which requires extra environment samples, and 
is infeasible in realistic problems. 

We wish to devise a PBRS framework that is capable of 
improving learning speed, without introducing extra sample 
complexity. To this end, rather than learn a single pol¬ 
icy shaped with the most effective heuristic on its optimal 
scale, we propose to maintain an ensemble of policies that all 
learn from the same experience, but are shaped w.r.t. differ¬ 
ent heuristics and different scaling factors. The deployment 
of our ensemble thus does not require any additional envi¬ 
ronment samples, and frees the designer up to benefit from 
PBRS, equipped only with a set of intuitive heuristic rules, 
with no necessary knowledge of their performance and value 
magnitudes. 

Because (for the purpose of not requiring extra environ¬ 
ment samples), all member-policies learn to maximize differ¬ 
ent reward functions from the same experience, the learning 
needs to be reliable off-policy. Because the introduced com¬ 
putational complexity (for each of the additional member- 
policies) amounts to that of the off-policy learner, we wish 
for the learning to be as efficient as possible. The recently 
introduced Horde architecture [26 j is well-suited to be the 
basis of our ensemble, due to its general off-policy conver¬ 
gence guarantees and computational efficiency. In contrast 
to the previous uses of Horde [ 51 ], we exploit its power to 
learn a single task, but from multiple viewpoints. 

The convergence guarantees of Horde require a latent learn¬ 
ing scenario 15 , i.e. one of (off-policy) learning under a 
fixed (or slowly changing) behavior policy. This scenario is 
particularly relevant to real-world applications, where fail¬ 
ure is highly penalized and the usual trial-and-error tactic 
is implausible, e.g. robotic setups. One could imagine the 


2 The agent will eventually still uncover the optimal policy, 
but instead of helping him get there faster, reward shaping 
would slow the learning down. 





agent following a safe exploratory policy, while learning the 
target control policy, and only executing the target policy 
after it is learnt. That is the scenario we focus on in this 
paper. Note that the conventional interpretation of PBRS 
to steer exploration [6], does not apply here, as the behavior 
is unaffected by the target policy, and is kept fixed. This 
work (and its precursor [§]) provides, to our knowledge, the 
first validation of PBRS effective in such a latent setting. 

Our contribution is two-fold: (1) we formulate and empir¬ 
ically validate a PBRS framework as a policy ensemble, that 
is capable of increasing learning speed without adding extra 
sample complexity, and that does so with general conver¬ 
gence guarantees. Specifically, we demonstrate how such an 
ensemble can be used to lift the problems of both the choice 
of the potential function and its scaling , thus removing the 
need of behind-the-scenes tuning necessary before deploy¬ 
ment; and (2) we validate PBRS to be effective in a latent 
off-policy setting, in which it cannot steer the exploration 
strategy. 

In the following section we give an overview of the pre¬ 
liminaries. Section [ 3 ] motivates our approach further, while 
Section [4] describes the proposed architecture and the voting 
techniques used to obtain the target ensemble policy. Sec- 
tion[5]presents empirical results in two classical benchmarks, 
and Section |6] concludes. 

2. BACKGROUND 

We assume the usual RL framework 25], in which the 
agent interacts with its (typically) Markovian environment 
at discrete time steps t — 1, 2,.... Formally, a Markov De¬ 
cision Process (MDP) [ 22 ] is a tuple M — (<S, A, 7 , T, R), 
where: S is a set of states , A is a set of actions , 7 G [0,1] 
is the discounting factor , T — {P sa (-)I S £ <S, a G A} are the 
next state transition probabilities with P sa (s') specifying the 
probability of state s' occuring upon taking action a from 
state s, R : S X A x S ^ R is the reward function with 
R(s, a, s') giving the expected value of the reward that will 
be received when a is taken in state s, and rt+i denoting 
the component of R at time t. 

A (stochastic) Markovian policy tv : S x A —> [0,1] is 
a probability distribution over actions at each state, s.t. 
7 r(s, a) gives the probability of action a being taken from 
state s under policy tv. In the deterministic case, we will 
take 7r(s) — a to mean tv(s, a) = 1. 

Value-based methods encode policies through value func¬ 
tions , which denote expected cumulative reward obtained 
while following the policy. We focus on state-action value 
functions. In a discounted setting: 

00 

Q n (s,a) = Et,tt 7*^+1 l s o = s,a 0 = a] (1) 
t =0 

An action a* is greedy in a state s, if it is the action of 
maximum value in s. A (deterministic) policy is greedy, if 
it picks the greedy action in each state: 

7 r(s) = argmaxQ 7 r (s, a), Vs G S (2) 

a 

A policy 7T* is optimal if its value is largest: 

Q*(s,a) — sup Q 7 r (s,a),Vs G <S,Va G A 


The learning is on-policy if the behavior policy 717 that the 
agent is following is the same as the target policy tv that 
the agent is evaluating. Otherwise, it is off-policy. Given 
7 Tb, the values of the optimal greedy policy can be learned 
incrementally through the following Q-learning 130 update: 

Qt -\-1 (st, at ) — Qt (<$£5 Q>t) T OLt&t (3) 

St = r t +7 max Q t (s t +i,a*) - Qt(s t ,a t ) (4) 

a* eA 

where Q t is an estimate of at time t, a t G (0,1) is the 
learning rate at time t, at is chosen according to 717 , St is 
the temporal-difference (TD) error of the transition. s t + 1 is 
drawn according to T, given st and at, and a* is the greedy 
action w.r.t. Q t in st+i Given tabular representation, this 
process is shown to converge to the correct value estimates 
(the TD-fixpoint) in the limit under standard approximation 
conditions ( 9 ]. 

When the state or action spaces are too large, or continu¬ 
ous, tabular representations do not suffice and one needs to 
use function approximation (FA). The state (or state-action) 
space is then represented through a set of features 0 , and 
the algorithms learn the value of a parameter vector 0. In 
the (common) linear case: 

Q t (s,a) = 6t 4> s ,a,Vs G <S, Va G A (5) 

and Eq. © becomes: 

Ot+i — 0t + oitStvft, ( 6 ) 

where we slightly abuse notation by letting <f t denote the 
state-action features 4> St ,a t i and St is still computed accord¬ 
ing to Eq. 0 - 

In the next two subsections we present the core ingredients 
to our approach. 

2.1 Horde 

FA is known to cause off-policy bootstrapping methods 
(such as Q-learning) to diverge even on simple problems ( 2 | 
|28| . The family of gradient temporal difference ( GTD) meth¬ 
ods provides a solution for this issue, and guarantees off- 
policy convergence under FA, given a fixed (or slowly chang¬ 
ing behavior) [26]. Previously, similar guarantees were pro¬ 
vided only by second-order batch methods (e.g. LSTD 3]), 
unsuitable for online learning. GTD methods are the first 
to maintain these guarantees, while maintaining the (time 
and space) complexity linear in the size of the state space. 
Note that linearity is a lower bound on what is achievable, 
because it is required to simply store and access the learn¬ 
ing vectors. As a consequence, GTD methods scale well to 
the number of value functions (policies) learnt [l9], and due 
to the inherent off-policy setting, can do so from a single 
stream of environment interactions (or experience). Sutton 
et al. [27] formalize this idea in a framework of parallel off- 
policy learners, called Horde. They demonstrate Horde to be 
able to learn thousands of predictive and goal-oriented value 
functions in real-time from a single unsupervised stream of 
sensorimotor experience. There have been further successful 
applications of Horde in realistic robotic setups [21]. 

On the technical levelQ GTD methods are based on the 

3 Please refer to Maei’s dissertation for the full details 
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idea of performing gradient descent on a reformulated ob¬ 
jective function, which ensures convergence to the projected 
TD-fixpoint, by introducing a gradient bias into the TD- 
update [26]. Mechanistically, it requires maintaining and 
learning a second set of weights w , along with 0 , and per¬ 
forming the following updates: 


Ot+i = 0 t + Oi t d t (j)t ~ w t ) (7) 

w t +i = w t +/3t(5 t - (j)Jw t )(j)t (8) 

where St is still computed with Eq. Q, and <j> t is the 
feature vector of the next state and action. This is a sim¬ 
pler form of the GTD-update, namely that of TDC [26;. 
GQ(A) [l4] augments this update with eligibility traces. 

Convergence is one of the two theoretical hurdles with off- 
policy learning under FA. The other has to do with the qual¬ 
ity of solutions under off-policy sampling, which may, in gen¬ 
eral, fall far from optimum, even when the approximator can 
represent the true value function well. In, to our knowledge, 
the only work that addresses this issue, Kolter |10| gives 
a way of constraining the solution space to achieve stronger 
qualitative guarantees, but his algorithm has quadratic com¬ 
plexity and thus is not scalable. Since scalability is crucial in 
our framework, Horde remains the only plausible convergent 
architecture available. 

2.2 Reward Shaping 

Reward shaping augments the true reward signal R with 
an additional shaping reward F, provided by the designer. 
The shaping reward is intended to guide the agent, when 
the environmental rewards are sparse or uninformative, in 
order to speed up learning. In its most general form: 

R' = R + F (9) 

Because tasks are identified by their reward function, mod¬ 
ifying the reward function needs to be done with care, in 
order to not alter the task, or else reward shaping can slow 
down or even prevent finding the optimal policy [23 . Ng et 
al. [ 20 ] show that grounding the shaping rewards in state po¬ 
tentials is both necessary and sufficient for ensuring preser¬ 
vation of the (optimal) policies of the original MDP. Potential- 
based reward shaping (PBRS) maintains a potential function 
<f> : S R, and defines the auxiliary reward function F as: 

F(s,a,s) = 7 $(s') -<L(s) (10) 

where 7 is the discounting factor of the MDP. We refer to 
the rewards, value functions and policies, augmented with 
shaping rewards as shaped. Shaped policies converge to the 
same (optimal) policies as the base learner, but differ during 
the learning process. 

3. A HORDE OF SHAPINGS 

The key insight in ensemble learning is that the strength 
of an ensemble lies in the diversity its components con¬ 
tribute [i ll . In the RL context, this diversity can be ex¬ 
pressed through several aspects, related to dimensions of the 
learning process: ( 1 ) diversity of experience , ( 2 ) diversity of 
algorithms and (3) diversity of reward signals. Diversity of 
experience naturally implies high sample complexity, and 
assumes either a multi-agent setup, or learning in stages. 
Diversity of algorithms (given the same experience) is com¬ 
putationally costly, as it requires separate representations, 


and one needs to be particular about the choice of algo¬ 
rithms due to convergence considerations^] In the context 
of our aim of increasing learning speed, without introduc¬ 
ing complexity elsewhere, we focus on the latter aspect of 
diversity: diversity of reward signals. 

PBRS is an elegant and theoretically attractive approach 
to introducing diversity into the reward function, by drawing 
from the available domain knowledge. Such knowledge can 
often be described as a set of simple heuristics. Combining 
the corresponding potentials beforehand naively (e.g. with 
linear scalarization) may result in information loss, when the 
heuristics counterweigh each other, and introduce further 
scaling issues, since the relative magnitudes of the potential 
functions may differ. Maintaining the shapings separately 
has recently been shown to be a more robust and effective 
approach [ 4 ]. Under the requirements of convergence and 
efficiency, maintaining such an ensemble of policies learning 
in parallel and shaped with different potentials, is only pos¬ 
sible via the Horde architecture, which is the approach we 
take in this paper. Thus, the proposed ensemble is the first 
of its kind to possess general convergence guarantees. 

Horde’s demonstrated ability to learn thousands of poli- 
in parallel in real time [27, 19 allows to consider large 


ensembles, at little computational cost. While defining thou¬ 
sands of distinct heuristics is rarely sensible, each heuristic 
may be learnt on many different scaling factors. This not 
only frees one from having to tune the scaling factor a priori 
(one of the issues we focus on in this paper), but potentially 
allows for automatically dynamic scaling, corresponding to 
state-dependent shaping magnitudes. 

Shaping Off-Policy 

The effects of PBRS on the learning process are usually con¬ 
sidered to lie in the guidance of exploration during learn¬ 
ing 6 ( 17, 20 . Laud and DeJong [l2 formalize this by 
showing that the difficulty of learning is most dependent on 
the reward horizon , a measure of the number of decisions 
a learning agent must make before experiencing accurate 
feedback, and that reward shaping artificially reduces this 
horizon. In our latent setting we assume no control over 
the agent’s behavior. The performance benefits then can be 
explained by faster knowledge propagation through the TD 
updates, which we now observe decoupled from guidance of 
exploration. 

Reward shaping in such off-policy settings is not well stud¬ 
ied or understood, and these effects are of independent in¬ 
terest. 


4. ARCHITECTURE 

We are now ready to describe the architecture of our en¬ 
semble (Fig. [l]).We maintain our Horde of shapings as a 
set T> of Greedy-GQ(A)-learners [l4]. Given a set of po¬ 
tential functions <I> = {<f>i,... <£ 7 } a range of scaling factors 
c 1 — (cl,... 4 .) for each <U, and the base reward function 
R, the ensemble reward function is a vector: 



where F^j is the potential-based shaping reward given 


4 See the discussion on convergence in Section 6.1.2 of van 
Hasselt’s dissertation [29] . 








by Eq. |To| w.r.t. the potential function <£>i and scaled with 
the factor c}. For notational clarity, we will take Fj to mean 

F® 1 (i.e. the shaping w.r.t. to the i-th potential function 

c j 

on the j-th scaling factor), and R) — R + Fj. We allow the 
ensemble the option to include the base learner. 

We adopt the terminology of Sutton et al. [27], and refer 
to individual agents within Horde as demons. Each demon 
d] learns a greedy policy 7 t) w.r.t. its reward R). Recall 
that our latent setting implies that the learning is guided 
by a fixed behavior policy 7r&, with 7r] all learning in par¬ 
allel from the experience generated by 717,. Because each 
policy 7 Tj is available separately at each step, an ensemble 
policy can be devised by collecting votes on action prefer¬ 
ences from all demons d). The ensemble is also latent, and 
not executed until the learning has ended. Note that be¬ 
cause PBRS preserves all of the optimal policies from the 
original problem [20], the ensemble policy does too. 

In this paper we have considered two voting schemes: ma¬ 
jority voting and rank voting, which are elaborated below. 
The architecture is certainly not limited to these choices. 

4.1 Ensemble Policy 

To the best of our knowledge, both voting methods were 
first used in the context of RL agents by Wiering and Van 
Hasselt [31]. In both methods, each demon d casts a vote 
Vd : S x A N°, s.t. Vd (s, a) is the preference value of action 
a in state s. The voting scheme then is defined for policies, 
rather than value functions, which mitigates the magnitude 
biasQ The ensemble policy acts greedily (with ties broken 
randomly) w.r.t. the cumulative preference values P: 

P(s t ,a) = ^2 Vd{s t ,a),Wa € A (12) 

dec 

The voting scheme determines the manner in which va are 
assigned. 

Majority voting Each demon d casts a vote of 1 for its 
most preferred action, and a vote of 0 for the others. 
I.e.: 

fl if Q(s, a) = maxQ(s, a*) 
va(s,a)= < “* (13) 

[0 otherwise. 

Rank voting Each demon greedily ranks its n actions, from 
n — 1 for its most, to 0 for its least preferred ac¬ 
tions. We slightly modify the formulation from [3l], by 
ranking Q-values, instead of policy probabilities. I.e. 
Vd(s,a) > v d (s,a'), if and only if Qd(s,a) > Q d (s,a). 

5. EXPERIMENTS 

We now present the empirical studies that validate the ef¬ 
ficacy of our ensemble architecture w.r.t. both the choice of 
heuristic and the choice of scale. We first consider the sce¬ 
nario of choosing between heuristics, and evaluate an ensem¬ 
ble consisting of shapings with appropriate scaling factors. 
The experiments show that the ensemble policy performs 
at least as well as the best heuristic. We then turn to the 

5 Note that even though the shaped policies are the same 
upon convergence - the value functions are not. 
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Ensemble policy 

Figure 1: An overview of the Horde architecture used to learn 
an ensemble of shapings (including the base learner). Vectors 
are indicated with bold lines. Kj is the reward obtained when 
applying to R and scaling with cj. The blue output of the 
linear function approximation block are the features of the 
transition (two state-action pairs), with their intersections with 
0j representing weights, a' is a vector of greedy actions at s' 
w.r.t. to each policy 7r}. Note that in this latent settings, all 
interactions with the environment happen only in the upper left 
corner. 

problem of scaling, and demonstrate that ensembles on both 
narrow and broad ranges of scales perform at least as well 
as the one w.r.t. the optimal scaling factors. 

We carry out our experiments on two common benchmark 
problems. In both problems, the behavior policy is a uni¬ 
form distribution over all actions at each time step. The 
evaluation is done by interrupting the base learner every z 
episodes and executing the queried greedy policy once. No 
learning is allowed during evaluation. 

We evaluated the ensembles w.r.t. both voting schemes 
from Sec. |4.1| and found the (sum) performance to be not 
significantly different (p > 0.05), with rank voting perform¬ 
ing slightly better. To keep the clarity of focus, below we 
only present the results for the rank voting scheme, but 
emphasize that the performance is not conditional on this 
choice. 

5.1 Mountain Car 

We begin with the classical benchmark domain of moun¬ 
tain car [25]. The task is to drive an underpowered car up 
a hill. The (continuous) state of the system is composed of 
the current position (in [—1.2, 0.6]) and the current velocity 
(in [—0.07,0.07]) of the car. Actions are discrete, a throttle 
of { — 1,0,1}. The agent starts at the position —0.5 and a 
velocity of 0, and the goal is at the position 0.6. The re¬ 
wards are —1 for every time step. An episode ends when 
the goal is reached, or when 2000 steps have elapsed. The 
state space is approximated with the standard tile-coding 















































technique [25], using ten tilings of 10 x 10, with a parameter 
vector learnt for each action. 

In this domain we define three intuitive shaping potentials: 

Position Encourage progress to the right (in the direction 
of the goal). This potential is flawed by design, since 
in order to get to the goal, one needs to first move 
away from it. 

4>i(x) — x (14) 

Height Encourage higher positions (potential energy): 

4>2(x) 3=2 h (15) 

Speed Encourage higher speeds (kinetic energy): 

3> 3 (x) = \x\ 2 (16) 

Here x = (x,x) is the state (position and velocity), and a 
denotes the normalization of a onto [ 0 , 1 ]. 

We used 7 = 0.99. The learning parameters were tuned 
w.r.t. the base learner and shared among all demons: A = 
0A,/3 = 0.0001, a = 0.1, where A is the trace decay pa¬ 
rameter, the step size for the second set of weights w in 
Greedy-GQ, and a the step size for the main parameter vec¬ 
tor 0. We ran 1000 independent runs of 100 episodes each, 
with evaluation occuring every 5 episodes (z — 5). 



Figure 2: Learning curves of the single shapings and their 
ensembles in mountain car. E \, the ensemble of two comparable 
shapings, outperforms both of them, whereas E 2 , the ensemble 
of all three shapings, matches (p > 0.05) the performance of the 
(more effective) third shaping cfe. 

5.1.1 Choice of Heuristic 

In this experiment]^] we address the question of the choice 
between heuristics. We thus consider ensembles composed 
of the demons shaped with the three shaping potential func¬ 
tions 4 >i ,$2 and 4 > 3 , and scaled with factors 01 , 02,03 that 
have been tuned beforehand. We associate the learner di 
with dff 

When evaluating the shapings individually, we witness d 3 
to perform best amongst the three. To examine the quality 
of our ensembles w.r.t. the quality of its components, we 

6 This experiment first appeared in the early version of this 
work [§]. 


consider two scenarios: E\ — (dijcfe) of two demons and 
E 2 — (di,d 2 ,cfe) of three demons. This corresponds to hav¬ 
ing ensemble consisting of two comparable shapings, and an 
ensemble with one clearly most efficient shaping. Thus, ide¬ 
ally, we would like E\ to outperform both d\ and d 2 and E 2 
to at least match the performance of d 3 . 

Fig. [2] presents the learning performance of the base agent, 
the demons di, cfe, d 3 shaped with single potentials, and the 
two ensembles E± and E 2 , mentioned above. We witness the 
individual shapings alone to aid the learning significantly. 

Ei follows d\ at first, when its performance is better, but 
switches to cfo, when the performance of d\ levels out. This 
is because di (as is appropriate with its position shaping) 
persists on going right in the beginning of an episode, and 
this strategy, while effective at first, results in a plateau of 
a higher number of steps. The ensemble policy is able to 
avoid this by incorporating information from cfo. 

E 2 , the ensemble of all three shapings, begins better than 
both di and ofo, but slightly worse than cfe, the most effec¬ 
tive shaping. It, however, quickly catches up to cfa, with the 
overall performance of E 2 and d 3 being statistically indis¬ 
tinguishable. 

Thus, the performance of the ensembles meets our desider¬ 
ata: when there is clearly a best component, an ensemble 
statistically matches it, otherwise it outperforms all of its 
components. 

5.1.2 Choice of Scale 

The previous set of experiments assumed access to the 
best scaling factors ci, 02 , 03 . In practice obtaining these 
requires tuning each shaping prior to the use of the ensemble, 
a scenario we aim to avoid. In this section we demonstrate 
that ensembles on a range of scales perform at least as well, 
as those with cherry-picked components. 

Namely, we consider two scaling ranges C\ — (20, 40, 60, 80,100} 
and C 2 = ( 1 , 10 , 10 2 , 10 3 , 10 4 }, with the first being a rea¬ 
sonably close range to the optimal scales from the previous 
section, and the second being a general sweep, with no intu¬ 
ition or knowledge of the optimal scale. Before we proceed 
further, we illustrate the effect a scaling factor can have on 
the performance of a single shaping. Fig. [3] gives a com¬ 
parison of the performance of the shaping potential 4>2 over 
the (reasonable) scaling range C \. Even small differences in 
scale have dramatic effect on the shaping’s performance. 

Now let Ec ± and Ec 2 be the ensembles w.r.t. all three 
shapings on C\ and C 2 , resp., each totaling in 16 demons 
(including the base learner). We compare Ec 1 and Ec 2 
with E 2 (the ensemble w.r.t. the three shapings with tuned 
scaling factors, from the first experiment). We illustrate the 
range of performances of shapings for each scale range, by 
additionally plotting the average of the runs of each shaping 
across each scale. I.e. for the range Cj , and shaping 47, at 
each episode, this is the average of the rewards obtained by 
the demons d\, c^,.. ^d\ Cj \ i n that episode. 

Fig. [4] presents the results. Ec ± and Ec 2 are both sta¬ 
tistically the same (p > 0.05) as the tuned ensemble E 2 , 
despite their components having a much wider range of per¬ 
formance. 

5.2 Cart-Pole 

We now validate our framework on the problem of cart- 
pole [l 8 ]. The task is to balance a pole on top of a moving 
cart for as long as possible. The (continuous) state s con- 












0 



Figure 3: The range of performance of a single shaping w.r.t. 
different scales in mountain car. Each curve corresponds to the 
performance of a demon shaped with <£> 2 , with a scaling factor 
from the range C \. 
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Figure 4: Learning curves of the ensembles over the scale ranges 
Ci and C 2 in mountain car. The solid and dashed lines (for each 
of the three shapings) are the mean performance of the demons 
w.r.t. Ci and C 2 , respectively, and are plotted as a reference for 
the performance of the ensemble components. Note that there is 
no single demon with this performance. The performances of 
ensembles Ec 1 and Ec 2 are not significantly different from that 
of E 2 : the ensemble w.r.t. tuned components. 


tains the angle £ and angular velocity £ of the pole, and 
the position x and velocity x of the cart. There are two 
actions: a small positive and a small negative force applied 
to the cart. A pole falls if |£| > which terminates the 
episode. The track is bounded within [—4,4], but the sides 
are “soft”; the cart does not crash upon hitting them. The 
reward function penalizes a pole drop, and is 0 elsewhere. 
An episode terminates successfully, if the pole was balanced 
for 1000 steps. The state space is approximated with tile 
coding, using ten tilings of 10 X 10 over all 4 dimensions, 
with a parameter vector learnt for each action. 

We define two potential functions, corresponding to the 
angle and angular speed of the pole. 

Angle Discourage angles far from the equilibrium: 

$i( S ) = -iei 2 (17) 

Angular speed Discourage high speeds (which are likelier 


to result in dropping the pole): 

* 2 (s) = -ill 2 (18) 

We used 7 = 0.99. The learning parameters were tuned 
w.r.t. the base learner and set to A = 0.7, a = 0.1 and 
ft = 0.001. These settings were shared among all demons. 
We ran 100 independent runs of a 1000 episode each, with 
evaluation occuring every 50 episodes (z — 50). 

5.2.1 Choice of Heuristic and Scale 

In this experiment we evaluate the problems of the choice 
of the heuristic and its scale jointly. We consider a general 
scaling range C — ( 1 ,10,10 2 , 10 3 , 10 4 ), and three ensembles: 
Ec resp. Ec only comprised of the demons shaped w.r.t. 
<£>i resp. 4 > 2 across C (5 demons each), and Ec containing 
all 11 demons (including the base learner). As before, we 
illustrate the range of performances of shapings across the 
range of scales by, for each shaping, plotting the average 
performance of the demons w.r.t. that shaping across the 
entire scale range. I.e. for the shaping <£q, at each episode, 
this is the average of the rewards obtained by the demons 
d\, dp ,• • -,dl\c\ in that episode. 



Episodes / 50 


Figure 5: Learning curves for the ensembles Eq, E^ and Ec in 
cart-pole. The dashed lines (for each of the two shapings) 
denote the mean performance of the demons w.r.t. C, and 
plotted as a reference for the performance of the ensemble 
components. Note that there is no single demon with this 
performance. The performances of the global ensemble Ec 
follows the (more effective) first shaping, in the end matching 
the performance of the corresponding ensemble E 

Fig. [5] shows the results. All ensembles (and ensemble av¬ 
erages) improve over the base learner. The performance of 
Ec , the ensemble over the second shaping, matches that of 
the average from that ensemble, since all of its components 
perform similarly. On the other hand, Ec : the ensemble over 
the first shaping, does much better than the corresponding 
average. The global ensemble Ec over all of the demons 
starts out better than both Ec and Ec, then levels at the 
average performance of the (better) first shaping, and finally 
matches the performance of Ec. The global ensemble Ec 
thus correctly identifies both which shaping to follow: its 
performance always follows (or is better than) that of the 
more efficient first shaping (either on average, or the ensem¬ 
ble Ec), and on what scales : the final performance of Ec 
matches that of Ec , significantly improving over the average 
across the scale range. 

























6. CONCLUSIONS 

In this work we described a novel off-policy PBRS ensem¬ 
ble architecture that is able to improve learning speed in a 
latent setting, without requiring the extra sample complex¬ 
ity introduced by the steps of tuning the heuristic and its 
scale, typical to PBRS. We avoid these steps by learning an 
ensemble of policies w.r.t. many heuristics and scaling fac¬ 
tors simultaneously. Our ensemble possesses general conver¬ 
gence guarantees, while staying efficient, as it leverages the 
recent Horde architecture to learn a single task well. Our ex¬ 
periments validate the use of PBRS in the latent setting, and 
demonstrate the efficacy of the proposed ensemble. Namely, 
we show that the ensemble policy over both broad and nar¬ 
row ranges of scales performs at least as well as the one 
over a set of optimally pre-tuned components, which in turn 
performs at least as well as its best component-heuristic. 


Future Directions 

In this work we have assumed a shared set of parameters 
between the demons, an immediate extension would be to 
maintain demons that learn w.r.t. different parame ters . 
This is similar to the approach of Mari vat e and Littman [l6|, 
who learn to solve many variants of a problem for the best 
parameter settings in a generalized MDP. In our case the 
MDP (dynamics) will remain shared, but the individual pa¬ 
rameters of the demons will vary. 

It would be worthwhile to evaluate the framework w.r.t. 
different ensemble techniques that induce the target ensem¬ 
ble policy. This would be especially useful in domains where 
only select scaling factors of select heuristics offer improve¬ 
ment: taking a global majority vote over such an ensemble 
will likely not be as effective, as trying to determine which 
subset of demons to consider. One could, e.g., use confidence 
measures [2] to identify these demons. 

Instead of shaping demons with static potential functions, 
one could consider maintaining a layer of demons that each 
learn some potential function 17 7|, which are, in turn, fed 
into the layer of shaped demons who contribute to the en¬ 
semble policy. One needs to be realistic about attainability 
of learning this in time, since as argued by Ng et al. 20|, 
the best potential function correlates with the optimal value 
function V*, learning which would solve the base problem 
itself and render the potentials pointless. 
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